MACHINE LEARNING TUTORIALS

ML Tutorials: Exploratory Data Analysis (EDA)

[Enter image here]

Assuming the data you collected and prepared for your model is unbiased, normally distributed (when applicable), and free from errors is a dangerous assumption. Exploratory Data Analysis, arguably the most boring yet most important aspect of the ML Lifecycle, ensures that the data you're feeding your model meets your expectations and is of high quality. This tutorial will cover the following learning objectives:

  • What is EDA?
  • Why is EDA Necessary?
  • What Methods are Commonly Used in EDA?

What is EDA?




Summary

  • Exploratory Data Analysis (EDA) is a method used by Data Scientists to analyze datasets and identify its main characteristics. It's helps determine the best ways to manipulate datasets to best fit the algorithm you're basing your model on.
  • When collecting large amounts of data from disparate sources (e.g., Databases, websites, APIs), there will likely be anomalies/outliers present, fields that are displayed as strings but should be numeric, and/or calculated fields that were calculated incorrectly. EDA is used to address all these issues.
  • Univariate EDA is used to analyze and interpret data contained by a single variable/feature. This is helpful when you have a certain field in your dataset that has high variability or that you suspect to have outliers. Univariate EDA is commonly used on continuous, numeric fields.
  • Multivariate EDA is used to identify relationships between variables and identify basic trends in the data. This is commonly used to compare business logic with your dataset to verify the data is representative of the issue you're trying to solve or the opportunity you're trying to take advantage of.

What Methods are Commonly Used in EDA?




Summary

  • Common methods used in Univariate EDA include the following:
    • Histograms (continuous variables): These are used to show the distribution of values contained in the variable. These contain "bins" which represent a defined range of values. You can change the number of bins to best represent your data. These are used to visualize how normally distributed the values in the variable are.
    • Bar Charts (discrete variables): These are used to show the frequency of values across categories shown in the variable. For example, for a variable called "color", you could create a bar chart showing how frequently each value in the variable occurs in the dataset. These are used to visually compare the frequency of categories compared to what occurs in the real world.
  • Common methods used in Multivariate EDA include the following:
    • Descriptive Statistics (numeric variables): Some tools allow you to take all the numeric variables present in your dataset and view Descriptive Statistics such as Min, Max, Mean, Percentiles, and Counts. These are used to compare the values in the table to what you'd expect to see in the real world.
    • Correlation Matrices (numeric variables): These are tables that show the relationship between variables. The Pearson Correlation Ratio, the method most commonly used by Data Scientists, is a measure of how strongly correlated two variables are. This ratio ranges from -1 to 1, where -1 means a perfect negative correlation (when variable A goes up 1 unit, variable B goes down 1 unit) and 1 means a perfect positive correlation (when variable A goes up 1 unit, variable B goes down 1 unit).